Putting Successor Variety Stemming to Work

نویسندگان

  • Benno Stein
  • Martin Potthast
چکیده

Stemming algorithms find canonical forms for inflected words, e. g. for declined nouns or conjugated verbs. Since such a unification of words with respect to gender, number, time, and case is a language-specific issue, stemming algorithms operationalize a set of linguistically motivated rules for the language in question. The most well-known rule-based algorithm for the English language is from Porter [14]. The paper presents a statistical stemming approach which is based on the analysis of the distribution of word prefixes in a document collection, and which thus is widely language-independent. In particular, our approach addresses the problem of index construction for multi-lingual documents. Related work for statistical stemming focuses either on stemming quality [2,3] or on runtime performance [11], but neither provides a reasonable tradeoff between both. For selected retrieval tasks under vector-based document models we report on new results related to stemming quality and collection size dependency. Interestingly, successor variety stemming has neither been investigated under similarity concerns for index construction nor is it applied as a technology in current retrieval applications. As our results will show, this disregard is not justified.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of N-grams Conflation Approach in Text-Based Information Retrieval

This paper examines a conflation method based on the N-grams approach and evaluates its performance relative to the results achieved by other techniques such as Porter algorithm and successor variety stemming. In addition to that, an alternative way of enhancing the N-grams method, derived from the concept of inverse frequency weighing, is introduced and evaluated. The experimental results gene...

متن کامل

A Literature Review and Discussion of Malay Rule - Based Affix Elimination Algorithms

Abstrak Stemming is one of the techniques in natural language processing that is used to reduce a word to its root. Information retrieval and knowledge management can further be improved by improving the stemming process. There are four strategies that are being used widely in stemming that includes table lookup, rule-based affix elimination, successor variety and n-gram. However , not all of t...

متن کامل

Some Notes on Putting Formal Specifications to Productive Use

These notes are personal reflections, stemming from attempts to understand the sources of problems and successes in the application of work on formal specifications. Our intent is to provoke thought about the nature and value of work in the area; not to provide a set of well-tested results. Rather than focusing on yet another specification language, we have tried to take a broad view of the rol...

متن کامل

From “Manbearpig” to “Man bear pig”: An Evaluation of Unsupervised Word Segmentation Algorithms

In this paper, we explore diverse methods of unsupervised morphemic segmentation. We test Successor and Predecessor Count algorithms, Entropy algorithms, and Affix Discovery algorithms. The paper examines word stemming based on these algorithms, and the influence of training corpus size on segmentation accuracy. We propose variations on these algorithms to improve overall efficacy. While these ...

متن کامل

Improving Successor Variety for Morphological Segmentation

Successor variety is a commonly used measure for segmentation in language processing. It is based on a simple idea that large variety of letters (or phonemes) following an initial word (or utterance) segment indicates a possible boundary. It dates back to Harris (1955), and several methods based on successor variety have been used in the literature, particularly for the purposes of segmenting w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006